Licensed under the Creative Commons attribution-noncommercial license. Please share and remix noncommercially, mentioning its origin.
This document was produced using pomp version 1.4.7.2 and R version 3.2.4.
This tutorial aims to help you get started using pomp as a suite of tools for analysis of time series data based on dynamical systems models. First, we give some conceptual background regarding the class of models—partially observed Markov processes—that pomp handles. We then discuss some preliminaries: installing the package and so on. Next, using a basic question about ecological population regulation as an example, we load some data and implement some models as R objects of class pomp. Finally, we illustrate some of the package’s capabilities by using its algorithms to fit and compare the models using various inference methods.
As its name implies pomp is focused on partially observed Markov process models. These are also known as state-space model, hidden Markov models, and stochastic dynamical systems models. Such models consist of an unobserved stochastic state process, connected to the data via an explicit model of the observation process. We refer to the former as the process model and the latter as the measurement model. Each model is, in fact, a probability distribution. If we let \(Y_t\) denote the measurement process at time \(t\) and \(X_t\) the state process, then by definition, the process model is determined by the density \(f_{X_t|X_{t-1}}\) and the measurement process by the density \(f_{Y_t|X_t}\). We think of the data, \(y^*_t\), \(t=1,\dots,T\) as being a single realization of the \(Y\) process. To implement a POMP model in pomp, we have to specify the measurement and process distributions.
Note however, that, for each of the process and the measurement model there are two distinct operations that we might wish to perform:
We refer to the simulation of \(f_{X_t|X_{t-1}}\) as the rprocess component of the POMP model, the evaluation of \(f_{X_t|X_{t-1}}\) as the dprocess component, the simulation of \(f_{Y_t|X_t}\), as the rmeasure component, and the evaluation of \(f_{Y_t|X_t}\) as the dmeasure component. Methods that make no use of the dprocess component are called “plug-and-play” methods. At present, pomp is focused on such methods, so there is no reason to focus on the dprocess component any further. In the following, we will illustrate and explain how one specifies the rprocess, rmeasure, and dmeasure components of a model in pomp. We will illustrate this using some simple examples.
The following is a schematic of the structure of a POMP showing causal relations between the process model, the measurement model, and the data. The key perspective to keep in mind is that the model is to be viewed as the process that generated the data.
Here is another POMP model schematic, showing dependence among model variables.
State variables, \(X\), at time \(t\) depend only on state variables at the previous timestep. Measurements, \(Y\), at time \(t\) depend only on the state at that time. Formally, \[\mathrm{Prob}[X_t|X_0,\dots,X_{t-1},Y_1,\dots,Y_{t-1}]=\mathrm{Prob}[X_t|X_{t-1}]\] and \[\mathrm{Prob}[Y_t|X_0,\dots,X_{t},Y_1,\dots,Y_{t-1}]=\mathrm{Prob}[Y_t|X_{t}],\] for all \(t=1,\dots,T\).
To get started, we must install pomp, if it is not already installed. The package can be downloaded from CRAN, but the latest version is always available at the package homepage on Github. See the package website for installation instructions.
In this document, we will implement pomp models using the package’s Csnippet facility. This allows the user to write model components using snippets of C code, which is then compiled and linked into a running R session. This typically affords a manyfold increase in computation time. It is possible to avoid Csnippets entirely by writing model components as R functions, but the resulting implementations are typically too slow to be of practical use. To use Csnippets, you must be able to compile C codes. Compilers are not by default installed on Windows or Mac systems, so users of such systems must do a bit more work to make use of pomp’s facilities. The installation instructions on the package website give details.
We will illustrate pomp by performing a limited data analysis on a set of bird abundance data. The data are from annual censuses of a population of Parus major (Great Tit) in Wytham Wood, Oxfordshire. They were retrieved as dataset #10163 from the Global Population Dynamics Database version 2 (NERC Centre for Population Biology, Imperial College, 2010). The original source is McCleery and Perrins (1991).
We load the data by doing
parus.dat <- read.csv(text="
year,P
1960,148
1961,258
1962,185
1963,170
1964,267
1965,239
1966,196
1967,132
1968,167
1969,186
1970,128
1971,227
1972,174
1973,177
1974,137
1975,172
1976,119
1977,226
1978,166
1979,161
1980,199
1981,306
1982,206
1983,350
1984,214
1985,175
1986,211"
)
Here is a plot of these data:
ggplot(data=parus.dat,mapping=aes(x=year,y=P))+
geom_line()+geom_point()+
expand_limits(y=0)+
theme_classic()
To keep things simple, let’s attempt to explain these data using a very simple model of stable but stochastic population dynamics, the logistic, or Verhulst-Pearl, equation with environmental stochasticity. We’ll write this model as a stochastic differential equation (SDE), specifically an Ito diffusion: \[dN = r\,N\,\left(1-\frac{N}{K}\right)\,dt+\sigma\,N\,dW(t),\] where \(N\) is the population size, \(r\) is a fixed rate, the so-called “Malthusian parameter”, \(K\) is the population’s “carrying capacity”, \(\sigma\) describes the intensity of extrinsic stochastic forcing on the system, and \(dW\) is an increment of a standard Wiener process. [Those unfamiliar with Wiener processes and Ito diffusions will find it useful to visualize \(dW(t)\), for each time \(t\), as a normal random variable with mean zero and standard deviation \(\sqrt{dt}\).] To implement this model in pomp, we must tell the package how to simulate this model. The easiest way to simulate such an SDE is via the Euler-Maruyama method. In this approximation, we take a large number of very small steps, each of duration \(\Delta t\). At each step, we hold the right-hand side of the above equation constant, compute \(\Delta N\) using that equation, and increment \(N\) accordingly. pomp gives us the euler.sim function to assist us in implementing the Euler-Maruyama method. To use it, we must encode the computations that take a single step. The best way to do this is to write a snippet of code in the C language. The following is such a snippet for the stochastic logistic equation above.
step.fun <- Csnippet("
double dW = rnorm(0,sqrt(dt));
N += r*N*(1-N/K)*dt+sigma*N*dW;
")
This is just a snippet of C code: not all the variables are declared and the context of the snippet is not specified. When given this snippet, pomp will provide the necessary declarations and context, compile the resulting C code, dynamically link to it, and use it to simulate realizations of the process model. We cause all this to happen when we construct an object of class pomp via a call to the constructor function, which is also named pomp:
parus <- pomp(data=parus.dat,time="year",t0=1959,
rprocess=euler.sim(step.fun=step.fun,delta.t=1/365),
statenames="N",paramnames=c("r","K","sigma"))
In the above, we’ve specified an Euler-Maruyama time-step of about one day. The t0 argument specifies that we are going to treat the stochastic state process as being initialized in year 1959. We use the statenames and paramnames arguments to indicate which of the undeclared variables in the Csnippet step.fun are state variables and which are fixed parameters. Since dW is a local variable, we must provide an ordinary C declaration for it. Note that dt is a variable that is defined in the context of this snippet; it is actually provided by pomp and will contain the size of the Euler step. The rnorm function is part of the R API: see the manual on “Writing R Extensions” for a description of this and the other distribution functions provided as part of the R API. Finally, note that the state variable N is over-written by this snippet: it’s value when the first line is executed is overwritten by the second line.
A full set of rules for writing pomp C snippets is given in the package help ?Csnippet.
With the process model in place, we can simulate the process model using the simulate function and plot several realizations for a given set of parameters.
simStates <- simulate(parus,nsim=10,params=c(r=0.2,K=200,sigma=0.5,N.0=200),states=TRUE)
Although it is the process model that is usually the focus of an investigator’s interest, the measurement model is equally important because it connects the process model to the data. For this example, let’s model the observations using a Poisson distribution: \[P_t \sim \mathrm{Poisson}(N_t),\] where, at time \(t\), \(P_t\) is our census count and \(N_t\) is the true population size.
The following snippet encodes the rmeasure component for this model.
rmeas <- Csnippet("
P = rpois(N);
")
In this snippet of code, N is the state variable and P is the name of our observable, as defined in the dataset parus.dat. The rpois function is part of the R API: it takes a single argument—the Poisson distribution’s parameter—and returns a pseudo-random draw from the Poisson distribution with that parameter. As with the step.fun snippet above, execution of this snippet causes P to be overwritten.
To fold this into our pomp object, we do
parus <- pomp(parus,rmeasure=rmeas,statenames="N")
Note that, again, we must tell pomp which of the variables (if any) is a parameter and which is a state variable. We can now simulate from the full POMP model and plot the results.
sim <- simulate(parus,params=c(r=0.2,K=200,sigma=0.5,N.0=200),
nsim=10,obs=TRUE,states=TRUE)
The following snippet encodes the dmeasure component.
dmeas <- Csnippet("
lik = dpois(P,N,give_log);
")
and we fold it into the pomp object via
parus <- pomp(parus,dmeasure=dmeas,statenames="N")
In the dmeas snippet, dpois again comes from the R API. It takes three arguments, the datum, the Poisson parameter, and give_log. When give_log=0, dpois returns the Poisson likelihood; when give_log=1, dpois returns the log of this likelihood. When this snippet is executed, pomp will provide the value of give_log according to its needs. It is the user’s responsibility to make sure that the correct value is returned.
Since we now have both the rprocess and the dmeasure components in place, we can perform a particle filtering operation to estimate the log likelihood at any given point in parameter space. For example, we can do
pf <- pfilter(parus,Np=1000,params=c(r=0.2,K=200,sigma=0.5,N.0=200))
logLik(pf)
## [1] -152.1487
A deterministic skeleton of the logistic model is obtained by taking \(\sigma\to 0\), which leads to the deterministic differential equation \[\frac{dN}{dt} = r\,N\,\left(1-\frac{N}{K}\right).\] The following snippet encodes this deterministic skeleton and folds it into the pomp object.
skel <- Csnippet("
DN = r*N*(1-N/K);
")
parus <- pomp(parus,skeleton=skel,skeleton.type="vectorfield",statenames="N",paramnames=c("r","K"))
Note that in this snippet, DN is filled with the value of the time-derivative of N. See the package help (?Csnippet) for a complete set of rules for writing C snippets.
With the deterministic skeleton in place we can generate several trajectories of the skeleton using trajectory, as follows, and plot the result.
pars <- parmat(c(r=1,K=200,sigma=0.5,N.0=20),5)
pars["N.0",] <- seq(20,300,length=5)
traj <- trajectory(parus,params=pars,times=seq(1959,1970,by=0.01))
Let us demonstrate the implementation of discrete-time process models by replacing the continuous-time logistic equation used above with the stochastic Beverton-Holt model \[N_{t+1}=\frac{a\,N_t}{1+b\,N_t}\,\varepsilon_t,\] where \(a\) and \(b\) are parameters and \(\varepsilon_t\sim\mathrm{Lognormal}(-\tfrac{1}{2}\sigma^2,\sigma)\).
Since we are now dealing with a discrete-time model, euler.sim will no longer construct an appropriate rprocess component. Instead, we can use discrete.time.sim, which also takes a C snippet encoding the one-step simulation. The following snippet simulates a single iteration of the stochastic Beverton-Holt map
bh.step <- Csnippet("
double eps = rlnorm(-sigma*sigma/2,sigma);
N = a*N/(1+b*N)*eps;
")
A corresponding skeleton is the deterministic Beverton-Holt map obtained by setting \(e_t=1\) in the above equation. A snippet that implements this map is
bh.skel <- Csnippet("
DN = a*N/(1+b*N);
")
Note that, as in the continuous case, we indicate the new value of the state variable by prepending D to the variable name.
We create a new pomp object based on this process model but with the same rmeasure and dmeasure components as in parus by executing
parus.bh <- pomp(parus,rprocess=discrete.time.sim(bh.step,delta.t=1),skeleton=bh.skel,skeleton.type="map",skelmap.delta.t=1,statenames="N",paramnames=c("a","b","sigma"))
The following codes test the above by computing some simulations, a trajectory of the skeleton, and running a particle filtering operation.
coef(parus.bh) <- c(a=1.1,b=5e-4,sigma=0.5,N.0=30)
sim <- simulate(parus.bh)
traj <- trajectory(parus.bh)
pf <- pfilter(parus.bh,Np=1000)
We can specify model-specific parameter transformations using C snippets. The following implements log transformation of the \(r\) and \(K\) parameters of the logistic model and folds them into the pomp object parus.
logtrans <- Csnippet("
Tr = log(r);
TK = log(K);
Tsigma = log(sigma);
")
exptrans <- Csnippet("
Tr = exp(r);
TK = exp(K);
Tsigma = exp(sigma);
")
parus <- pomp(parus,toEstimationScale=logtrans,
fromEstimationScale=exptrans,
paramnames=c("r","K","sigma"))
Trajectory matching is the method of fitting a deterministic model to data assuming independent errors. In pomp, the function traj.match searches parameter space to find parameters under which the likelihood of the data given a trajectory of the deterministic skeleton is maximized.
tm <- traj.match(parus,start=c(r=1,K=200,N.0=200,sigma=0.5),
est=c("r","K"),transform=TRUE)
signif(coef(tm),3)
## r K N.0 sigma
## 19.4 196.0 200.0 0.5
logLik(tm)
## [1] -276.4891
We can simulate the fitted model and compare it against the data.
coef(tm,"sigma") <- 0
simulate(tm,nsim=10,as.data.frame=TRUE,include.data=TRUE) %>%
ggplot(aes(x=time,y=P,group=sim,alpha=(sim=="data")))+
scale_alpha_manual(name="",values=c(`TRUE`=1,`FALSE`=0.2),
labels=c(`FALSE`="simulation",`TRUE`="data"))+
geom_line()
Iterated filtering (Ionides et al. 2006, 2015) is a method for maximizing the likelihood by repeatedly applying a particle filter.
The following codes apply the IF2 algorithm (Ionides et al. 2015).
bake(file="parus-mif.rds",{
guesses <- sobolDesign(lower=c(r=0,K=100,sigma=0,N.0=200),
upper=c(r=5,K=600,sigma=2,N.0=200),
nseq=100)
foreach (guess=iter(guesses,"row"),.combine=rbind,
.options.mpi=list(seed=334065675),
.packages=c("pomp","magrittr"),.errorhandling="remove") %dopar% {
parus %>%
mif2(start=unlist(guess),Nmif=50,Np=1000,transform=TRUE,
cooling.fraction.50=0.8,cooling.type="geometric",
rw.sd=rw.sd(r=0.02,K=0.02,sigma=0.02)) %>%
mif2() -> mf
ll <- logmeanexp(replicate(5,logLik(pfilter(mf))),se=TRUE)
data.frame(loglik=ll[1],loglik.se=ll[2],as.list(coef(mf)))
}
}) -> mles
In the above, we first use sobolDesign to construct a Latin hypercube design for the “starting points”, i.e., the points in parameter space where we will begin our searches for the MLE. At each of these (foreach), we first run 50 iterations of the IF2 algorithm (mif2). We then re-run the algorithm, starting from where the first run ended up (the second mif2). To obtain an estimate of the log likelihood at that (putative) MLE, we run 5 independent particle filtering operations (pfilter), averaging the likelihoods with proper attention paid to Jensen’s inequality (logmeanexp). This gives us both an unbiased estimate of the log likelihood at that point and an estimate of the Monte Carlo error in that estimate. A useful way of displaying the results of such multi-start search is the pairwise scatterplot matrix:
pairs(~loglik+r+K+sigma,data=mles)
We can improve the quality of our estimates and obtain likelihood-ratio-test based confidence intervals by constructing profile likelihoods. In a likelihood profile, one varies the focal parameter (or subset of parameters) across some range, maximizing the likelihood at each value of the focal parameter over the remaining parameters. The following codes construct a likelihood profile over \(r\).
profileDesign(
r=seq(from=0.1,to=5,length=21),
lower=c(K=100,sigma=0.01,N.0=200),upper=c(K=600,sigma=0.2,N.0=200),
nprof=50
) -> pd
dim(pd)
## [1] 1050 4
pairs(~r+K+sigma+N.0,data=pd)
bake("parus-profile.rds",{
foreach (p=iter(pd,"row"),
.combine=rbind,
.errorhandling="remove",
.packages=c("pomp","magrittr","reshape2","plyr"),
.inorder=FALSE,
.options.mpi=list(seed=1680158025)
) %dopar% {
parus %>%
mif2(start=unlist(p),Nmif=50,Np=1000,transform=TRUE,
cooling.fraction.50=0.8,cooling.type="geometric",
rw.sd=rw.sd(K=0.02,sigma=0.02)) %>%
mif2() -> mf
pf <- replicate(5,pfilter(mf,Np=1000)) ## independent particle filters
ll <- sapply(pf,logLik)
ll <- logmeanexp(ll,se=TRUE)
nfail <- sapply(pf,getElement,"nfail") ## number of filtering failures
data.frame(as.list(coef(mf)),
loglik = ll[1],
loglik.se = ll[2],
nfail.min = min(nfail),
nfail.max = max(nfail))
} %>% arrange(r,-loglik)
}) -> r_prof
pairs(~loglik+r+K+sigma,data=r_prof,subset=loglik>max(loglik)-10)
r_prof %>%
mutate(r=signif(r,8)) %>%
ddply(~r,subset,loglik==max(loglik)) %>%
ggplot(aes(x=r,y=loglik))+geom_point()+geom_smooth()
Now let us plot the data and 10 simulated realizations of the model process are plotted on the same axes.
r_prof %>%
subset(loglik==max(loglik)) %>% unlist() -> mle
simulate(parus,params=mle,nsim=10,as.data.frame=TRUE,include.data=TRUE) %>%
ggplot(mapping=aes(x=time,y=P,group=sim,alpha=(sim=="data")))+
scale_alpha_manual(name="",values=c(`TRUE`=1,`FALSE`=0.2),
labels=c(`FALSE`="simulation",`TRUE`="data"))+
geom_line()
If we seek to do full-information Bayesian inference, we can use particle Markov chain Monte Carlo, implemented in pomp in the pmcmc function. First, we add a prior to the pomp object by furnishing a prior probability density evaluation function (dprior).
parus %<>%
pomp(dprior=Csnippet("
lik = dunif(r,0,5,1)+dunif(K,100,600,1)+dunif(sigma,0,2,1);
lik = (give_log) ? lik : exp(lik);
"),paramnames=c("r","K","sigma"))
The following codes cause parallel pMCMC chains to be run, each beginning at one of the estimates obtained using mif2.
bake(file="parus-pmcmc.rds",{
r_prof %>% ddply(~r,subset,loglik==max(loglik)) %>%
subset(K > 100 & K < 600 & r < 5 & sigma < 2,
select=-c(loglik,loglik.se)) -> starts
foreach (start=iter(starts,"row"),.combine=c,
.options.mpi=list(seed=23781975),
.packages=c("pomp","magrittr"),.errorhandling="remove") %dopar%
{
parus %>%
pmcmc(Nmcmc=2000,Np=200,start=unlist(start),
proposal=mvn.rw.adaptive(rw.sd=c(r=0.1,K=100,sigma=0.1),
scale.start=100,shape.start=100)) -> chain
chain %>% pmcmc(Nmcmc=10000,proposal=mvn.rw(covmat(chain)))
}
}) -> chains
In the above, we first subset out those mif2 estimates that are consistent with our prior. At each of these, we then perform a number of MCMC moves, using an adaptive multivariate-normal random-walk proposal distribution mvn.rw.adaptive. To complete the inference, we do more MCMC iterations using a multivariate random-walk proposal (mvn.rw) with covariance matrix derived from the preceding computation (covmat).
Now we investigate the chains, to try and determine whether they have converged.
library(coda)
chains %>% conv.rec() -> traces
rejectionRate(traces[,c("r","K","sigma")])
## r K sigma
## 0.8089262 0.8089262 0.8089262
autocorr.diag(traces[,c("r","K","sigma")])
## r K sigma
## Lag 0 1.00000000 1.00000000 1.00000000
## Lag 1 0.91134072 0.89940901 0.90930642
## Lag 5 0.64729182 0.60764266 0.63608749
## Lag 10 0.44371738 0.39249861 0.42523417
## Lag 50 0.06048917 0.03652162 0.04221093
traces <- window(traces,thin=50,start=1000)
plot(traces[,"r"])
plot(traces[,"K"])
plot(traces[,"sigma"])
gelman.diag(traces[,c("r","K","sigma")])
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## r 1 1
## K 1 1
## sigma 1 1
##
## Multivariate psrf
##
## 1
Insofar as we can tell, the MCMC iterations appear to have converged to their stationary distribution. Let us now examine the posterior distribution. We will also simulate from the posterior median parameters.
summary(traces[,c("r","K","sigma")])
##
## Iterations = 1000:10000
## Thinning interval = 50
## Number of chains = 61
## Sample size per chain = 181
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## r 3.1412 1.1738 0.011171 0.013512
## K 211.7872 14.3722 0.136779 0.146503
## sigma 0.6311 0.1523 0.001449 0.001637
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## r 0.9009 2.2113 3.2261 4.1447 4.9328
## K 188.1699 202.3151 210.4414 219.6754 242.9445
## sigma 0.3583 0.5187 0.6283 0.7344 0.9425
theta <- summary(traces)$quantiles[c("r","K","sigma","N.0"),'50%']
simulate(parus,params=theta,nsim=10,as.data.frame=TRUE,include.data=TRUE) %>%
ggplot(mapping=aes(x=time,y=P,group=sim,alpha=(sim=="data")))+
scale_alpha_manual(name="",values=c(`TRUE`=1,`FALSE`=0.2),
labels=c(`FALSE`="simulation",`TRUE`="data"))+
geom_line()
If we desire an estimate of the smoothed state variables, i.e., the distribution of \(N_t\,\vert\,P_{1960},\,\dots,\,P_{1986}\), for \(t=1960,\,\dots\,1986\), we can use the filter.traj command to extract samples from this distribution. For example, the following codes extract a single trajectory from each MCMC sample. After thinning and discarding the burn-in, and assuming that the PMCMC chains above have converged, these are, as Andrieu et al. (2010) showed, samples from the smoothing distribution. The following codes extract samples of the trajectories of \(N\) from the posterior distribution (filter.traj), discard the burn-in and thin (rep > 1000 & rep %% 50 == 0), and plot the median and equal-tailed 95% credible interval at each time point.
chains %>% filter.traj() %>% melt() %>%
subset(rep > 1000 & rep %% 50 == 0) %>%
dcast(L1+rep+time~variable) %>%
ddply(~time,summarize,
prob=c(0.025,0.5,0.975),
q=quantile(N,prob)) %>%
mutate(qq=mapvalues(prob,from=c(0.025,0.5,0.975),to=c("lo","med","hi"))) %>%
dcast(time~qq,value.var='q') %>%
ggplot()+
geom_ribbon(aes(x=time,ymin=lo,ymax=hi),alpha=0.5,fill='blue')+
geom_line(aes(x=time,y=med),color='blue')+
geom_point(data=parus.dat,aes(x=year,y=P),color='black',size=2)+
labs(y="N")
Andrieu, C., A. Doucet, and R. Holenstein. 2010. Particle Markov chain Monte Carlo methods. J. R. Stat. Soc. B 72:269–342.
Ionides, E. L., C. Bretó, and A. A. King. 2006. Inference for nonlinear dynamical systems. Proc. Natl. Acad. Sci. U.S.A. 103:18438–18443.
Ionides, E. L., D. Nguyen, Y. Atchadé, S. Stoev, and A. A. King. 2015. Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proc. Natl. Acad. Sci. U.S.A. 112:719–724.
McCleery, R. H., and C. M. Perrins. 1991. Effects of predation on the numbers of great tits Parus major. Pages 129–147 in Bird population studies. relevence to conservation and management. Oxford University Press, Oxford.